Journal of Computational Biology — Latest Matching Preprints

1

Towards a Unified Exact Solution of Rearrangement Small Parsimony for Natural Genomes

Bohnenkaemper, L.; Frolova, D.

2026-06-28 bioinformatics 10.64898/2026.06.23.733974 medRxiv

Top 0.1%

19.0%

Show abstract

Phylogenetic reconstruction is a fundamental problem in comparative genomics. As a theoretical problem in rearrangement studies, this has been modelled as the Small Parsimony Problem (SPP), in which ancestral genome structures have to be determined minimizing the number of rearrangement events occurring throughout the phylogeny. This problem is of significant interest in microbial and cancer genomics, due to the prevalence and clinical importance of rearrangement events. Genome structures in this problem are expressed as sequences of markers, which are themselves oriented sequence features (such as genes) that abstract from non-structural variations. Recent research has focused on the problem under the natural genomes model, in which arbitrary variations in copy number of markers are allowed. Natural genomes are often studied under the DCJ-indel model, a model which has already been successfully applied to plasmid data. There also exist ILP solutions to a variant of the Small Parsimony Problem under the DCJ-indel model. However, these solutions are limited in their applicability, as they make some critical simplifications for tractability purposes: ancestral marker frequencies and precomputed putative ancestral adjancencies, with their predicted likelihoods, are assumed as input. This creates multiple problems from both a theoretical and practical perspective. Firstly, this simplification means that not the full state space is searched for a solution, but rather only the subset of genomes with the precomputed putative adjacencies, meaning an optimal solution to the exact SPP is not guaranteed. Secondly, marker frequencies are given externally, without any theoretical guarantees. Thirdly, the method used to precompute adjacencies relies on gene trees, which requires the use of genes as markers, when gene annotation is often unreliable, especially in regions with a lot of rearrangement. Additionally, this restricts the applicability of the approach to sets of genomes that are both divergent and large enough to be able to produce informative gene trees. This is, for example, rarely the case for plasmids, where nucleotide mutations are rarer than rearrangements and genomes are small. Hence, we revisit the problem to solve the exact SPP by introducing a cost to indel operations, which allows us to compute ranges of marker frequencies and derive theoretical results, that allow us to reduce the solution space that the ILP searches without sacrificing optimality. We show that this makes the problem tractable for the case of small and recently related genomes, first on simulated genomes, and then on a set of pathogenic plasmids which represent a realistic use case for the method.

2

Reachability-Preserving Minimum Edge Cut Problem and Applications in Biology

Xie, J.; Duan, Q.

2026-06-03 bioinformatics 10.64898/2026.06.01.729192 medRxiv

Top 0.1%

18.7%

Show abstract

Biological pathway analysis often requires identifying interventions that block reachability to an undesirable state, such as a disease-associated module, toxic byproduct, or adverse phenotype, while preserving reachability among essential biological functions. Motivated by this setting, we study the Reachability Preserving Minimum Edge Cut (RPMEC) problem: given protected terminals s1 and s2 and a target terminal t, the goal is to remove a minimum-cost set of edges that separates s1 and s2 from t while keeping s1 and s2 connected. This formulation naturally models pathway-level intervention design, where one seeks to disrupt harmful signaling, metabolic, or interaction routes without breaking required functional connectivity. We revisit the three-terminal undirected edge-cut case and analyze a Dijkstra-style dynamic programming algorithm that is exact on planar graphs but fails on general graphs. We characterize the structural condition required for exactness, namely frontier-realizability of optimal source-side regions, and identify biological graph representations where this condition is likely to hold after appropriate preprocessing, including curated planar pathway maps, Reactome-style hierarchy trees, SCC-contracted feedback modules, metabolic building-block DAGs with dominator structure, and functional-module quotients of protein interaction networks. We further discuss directed variants, approximation strategies, and exact alternatives based on ASP, MILP, bounded-treewidth dynamic programming, and important separators. The results provide a graph-theoretic foundation for deciding when fast greedy computation is reliable for biological pathway intervention problems and when more expressive exact optimization methods are needed. Author SummaryMany real-world networks require interventions that separate harmful or undesirable states while preserving essential connectivity. This situation appears in biological pathway analysis, where one may want to block reachability to a disease-related module, toxic byproduct, or adverse phenotype without disrupting communication among essential genes, proteins, reactions, or metabolites. We study this problem through the Reachability Preserving Minimum Edge Cut formulation. Unlike ordinary minimum cut, the solution must satisfy both a separation requirement and a preservation requirement. We show why a natural Dijkstra-style algorithm works only under specific structural conditions, such as planar, laminar, or module-like pathway graphs, and why it may fail on general graphs. The results help identify when fast graph algorithms are reliable for biological intervention problems and when exact optimization tools such as Answer Set Programming or integer programming are more appropriate.

3

Expanding gene regulatory networks from transcriptome data through graphical modeling with heterogeneous priors

Kokaji, T.; Suzuki, K. T.; Kunida, K.; Sakumura, Y.

2026-06-16 bioinformatics 10.64898/2026.06.12.731835 medRxiv

Top 0.1%

11.6%

Show abstract

Gene regulatory network inference is widely used to reconstruct large-scale networks and identify functional genes from transcriptome data. Meanwhile, in many biological fields, core regulatory genes have been extensively studied, leading to the establishment of small-scale gene regulatory networks, and novel genes connected to these networks remain to be identified. However, methods for expanding existing gene networks by identifying novel regulatory interactions, rather than reconstructing the entire network, are not well established. Here, we propose a method for gene network expansion that incorporates known regulatory relationships and evaluates each candidate gene individually to infer its regulatory connections to the existing network. Using simulated datasets from the DREAM4 benchmark and the PRECISE-1K experimental dataset, our method outperformed conventional methods by incorporating prior knowledge. In particular, it improved the ability to distinguish true regulatory interactions from indirect associations arising from strong correlations among genes in the existing network. The method also showed strong performance for interactions involving genes with high outdegree or centrality. Furthermore, it maintained stable performance as the size of the existing network increased and was robust to noise in prior information. These results demonstrate that our method provides an effective framework for expanding existing gene regulatory networks by leveraging prior knowledge.

4

Extended t-cores for the de novo identification of transposable elements and other inexact repeats from short read RNAseq data

Darmon, S.; Mary, A.; Lacroix, V.

2026-07-10 bioinformatics 10.64898/2026.07.06.736737 medRxiv

Top 0.1%

10.9%

Show abstract

Transcribed repeats represent a major challenge in the de novo assembly of transcriptomes from short RNA-seq reads. Young transposable elements (TEs) and other inexact repeats create dense and ambiguous regions in the assembly graph, preventing the correct assembly of transcripts. In this paper, we introduce a fully de novo method based on the discovery of dense regions in the compacted De Bruijn graph (DBG) to identify such repeats directly from short reads RNA-seq data, without requiring a reference genome or repeat database. Our approach defines the extended t-cores, subgraphs of the DBG that capture the complex topology induced by highly expressed inexact repeats appearing in RNA-seq reads. Independently of its interest for transcriptome assembly, the proposed method appears to be effective for the de novo identification of repeats in transcriptomes. After classifying cores using sequence-based motifs to distinguish simple repeats from potential TEs, we demonstrate its potential for the de novo discovery of transposable elements. We validate the approach on a Mus musculus dataset using expressed TE consensus sequences, showing that extended t-cores correspond to known expressed TE families. We also illustrate its de novo discovery potential on a non-model species, Canis lupus familiaris, where the method was also able to recover known transposable elements.

5

Fast Set Operations for Compact k-mer Sets

Alanko, J.; Depuydt, L.; MARCHET, C.; Puglisi, S. J.

2026-05-27 bioinformatics 10.64898/2026.05.24.727514 medRxiv

Top 0.1%

9.7%

Show abstract

The k-mer spectrum of a set of sequences is the set of k-length substrings the sequences contain. This lossy representation of sequence content pervades modern genomics. Recently, the spectral Burrows-Wheeler transform (SBWT) has emerged as a space-efficient representation of k-spectra that also supports efficient k-mer lookup queries and, more generally, easy navigation of the de Bruijn graph of the k-spectrum. In this paper, we examine primitive set operations, such as intersection, union, and set difference, on SBWT-encoded k-spectra and show that these operations can be supported efficiently. Moreover, efficient merging leads directly to a new memory-efficient algorithm for SBWT construction, which was able to build the SBWT for the 661K bacterial dataset containing 88 billion distinct k-mers in 50 hours using 186 GiB of RAM and 112 GiB of disk space. Given the pervasiveness of k-mer sets in genomics and the continued rapid growth of genomic databases, our work opens the door to a wide array of future applications that manipulate and reason about genomic data by dealing directly with simultaneously compact and searchable k-mer set representations offered by the SBWT. 2012 ACM Subject ClassificationTheory of computation [->] Design and analysis of algorithms Digital Object Identifier10.4230/LIPIcs.WABI.2026. Supplementary MaterialSoftware (Source Code): https://github.com/LoreDepuydt/sbwt-set-operations FundingThis work has benefited from funding from the French State under the France 2030 program, reference ANR-21-IDES-0006. The European Metropolis of Lille and the University of Lille are also acknowledged for their funding and support of the project WILL-CHAIRES-25-001-BOSSA.

6

Discriminative learning of substitution matrices and gap penalties for pairwise alignment of biological sequences

Ciach, M. A.; Zacharopoulou, E.; Startek, M. P.; Miasojedow, B.; Alexiou, P.

2026-05-18 bioinformatics 10.64898/2026.05.14.725168 medRxiv

Top 0.1%

9.3%

Show abstract

Pairwise alignment scores are used to classify pairs of sequences in many areas of bioinformatics, including homology search, predicting interactions, or read mapping. The relative scores of different pairs strongly depend on the choice of a substitution matrix and gap penalties, but the existing approaches for the estimation of these parameters do not directly optimize them for the task of classification. In this work, we present DiscrimAlign, a statistical model for discriminative learning of substitution matrices and gap penalties from a dataset of positive and negative pairs of unaligned biological sequences. The model links the alignment score of a sequence pair with the associated binary label through a logistic function and learns the parameters by likelihood maximization. We analyze theoretical properties of the model, derive and implement a learning procedure, study its performance in simulated experiments, and apply it to predict microRNA-target interactions. We show that sequence alignment with discriminative substitution matrices and gap penalties predicts the interactions comparably to state-of-the-art neural network classifiers while being more interpretable. An implementation of the model and reproducibility workflows are available at https://github.com/BioGeMT/DiscrimAlign.

7

Recursive exploration of metabolic yield space

Mores, W.; Bhonsale, S.; Floros, S.; Logist, F.; Van Impe, J. F. M.

2026-06-01 bioinformatics 10.64898/2026.05.28.728453 medRxiv

Top 0.1%

6.9%

Show abstract

Genome-scale metabolic network reconstructions contain extremely detailed and valuable information regarding cellular metabolism. For many applications such as finding genetic engineering targets and reduced kinetic model construction, metabolic network analysis techniques exist. Yield spaces based on the extreme rays of solution cones related to the metabolic network are frequently constructed for these types of analyses. However, for genome-scale networks, full enumeration of these extreme rays is not computationally feasible. In this work, a novel direct generation method for yield spaces is presented. This allows the application of many metabolic network analysis techniques to even the most recent genome-scale metabolic networks. Inspired by principles from multi-objective optimization algorithms, the proposed method performs highly efficient recursive exploration but specifically adapted to the mathematical properties of yield spaces. Two case studies showcase both the efficiency of the method and its applicability for analysis of genome-scale metabolic networks.

8

BertST: BERT-based Spatial Domain Identification in Patient Data

Nnadi, G. O.

2026-07-09 bioinformatics 10.64898/2026.07.04.736527 medRxiv

Top 0.1%

6.7%

Show abstract

Spatial transcriptomics enables the study of gene expression within its native tissue context, providing critical insights into cellular organization and microenvironment-driven biological processes. A key challenge in this field is spatial domain identification, which aims to partition tissue into coherent regions by jointly leveraging gene expression and spatial information. Existing approaches are predominantly based on Graph Neural Networks (GNNs), and approach based on Transformers particularly, Bidirectional Encoder Reppresentation Transformer (BERT) model for modelling both local and long-range dependencies remains largely unexplored. In this work, we propose BERT for Spatial Transcriptomics (BertST), a transformer-based framework that reformulates spatial transcriptomics as a graph-to-text representation learning problem. Building upon the BERTwalk paradigm, we construct a task-specific multi-graph representation integrating spatial adjacency, pruned gene-expression similarity, and a fully connected gene-expression graph. This design enables the modelling of both local spatial structure and global molecular relationships. Random walks over these graphs are treated as sequences, allowing a BERT model to learn contextualised node embeddings. To further enhance representation quality, we introduce a hierarchical multi-graph propagation strategy, where embedding refinement is performed sequentially: first on the fully connected graph to capture global structure, followed by the pruned graph to refine molecular relationships, and finally on the spatial graph to enforce local smoothness. This ordering ensures that global information is effectively distributed and progressively constrained by biologically meaningful neighbourhoods. We also improve computational efficiency by leveraging \textit{PecanPy}, a fast and scalable implementation of node2vec, enabling efficient random walk generation on dense graphs. Experimental results on multiple 10x Visium datasets, including DLPFC and Human Breast Cancer, demonstrate that BertST consistently outperforms or matches GNN-based methods such as ConST, CCST, and SpaceFlow in terms of Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI). Overall, BertST highlights the potential of transformer-based architectures for spatial omics analysis by effectively capturing both local and long-range spatial-molecular dependencies, offering a promising alternative to traditional graph-based methods.

9

DivQuant: Estimation of Species Richness and Entropy from Small Samples

Schmitz, J. E.; Rahmann, S.

2026-06-11 bioinformatics 10.64898/2026.06.08.730836 medRxiv

Top 0.1%

5.7%

Show abstract

Estimating diversity properties of discrete distributions from a small observed sample is a fundamental problem in algorithmic statistics that has applications in many fields, in particular bioinformatics, but also in ecology or linguistics. The two most common diversity measures are the number of distinct elements in a multiset, also referred to as "species richness" in ecology or "alpha diversity" in microbial analysis, and the Shannon entropy, also referred to as "evenness". Estimating these properties from a small sample is particularly challenging for distributions with many rare elements. Thus, many estimators have been proposed in the past that, in practice, work well for different types of distributions. We present DivQuant, an optimization-based, extrapolating richness and entropy estimator with three contributions. First, we formulate the upsampling problem as a convex quadratic program with a Neyman{chi} 2 objective. Unlike the linear program of its predecessor RichnEst, DivQuant admits confidence intervals via{chi} 2 test inversion that are empirically well-calibrated. Second, we replace RichnEsts fixed-threshold fingerprint truncation with the rare/abundant fingerprint split of Valiant and Valiant, which strongly reduces problem size and preserves enough degrees of freedom for the confidence-interval program to remain valid and feasible. Third, we plug the optimal population fingerprint returned by the program into Shannons entropy formula to obtain an entropy estimate. DivQuant attains close-to-nominal 95% confidence intervals in essentially all tested regimes, including six simulated distribution families, Tara Oceans microbiome data, and 10X Genomics scRNA-seq data, while competing state-of-the-art methods (RichnEst, iNext, PreSeq) miss the true richness in up to 80% of instances, well above the nominal 5%. In addition, DivQuant outperforms classical asymptotic entropy estimators (Miller-Madow, CAE) and the extrapolating iNext estimator. Running times remain competitive, with DivQuant typically completing in seconds. DivQuant is available as a command-line tool at https://gitlab.com/rahmannlab/divquant. 2012 ACM Subject ClassificationMathematics of computing[->] Probability and statistics; Mathematics of computing[->] Linear programming; Mathematics of computing[->] Quadratic programming; Applied computing[->] Bioinformatics

10

Understanding the bias of compositional microbiome differential abundance estimation

Calle, M. L.; Pujolassos, M.; Susin, A.

2026-04-30 bioinformatics 10.64898/2026.04.28.721392 medRxiv

Top 0.1%

5.6%

Show abstract

One of the most relevant objectives in microbiome studies is the identification of microbial species that are differentially abundant across conditions. However, the compositional nature of microbiome data complicates this task. Interdependence among components leads to spurious associations when the abundances of each component are analyzed separately. Due to the growing awareness of the challenges of compositional data analysis (CoDA), log-ratio transformations, such as the additive log-ratio (alr) or the centered log-ratio (clr) transformations, have become increasingly popular in microbiome studies. Several studies have compared the performance of compositional and non-compositional methods through simulations. However, the debate between these two frameworks remains unresolved, creating confusion among researchers. Rather than relying on simulation-based results, this work provides theoretical results that enable a more rigorous and conclusive analysis of the problem, contributing to a better understanding of differential abundance estimation. We provide theoretical expressions of the bias of differential abundance estimation related to the use of proportions (total sum scaling) and log-ratio transformations (alr and clr) when estimates are interpreted as absolute rather than relative to a reference. The factors that most strongly influence the bias are the magnitude and direction of the effects, the dimension of the composition, the proportion of differentially abundant variables, and the distribution of relative abundances. The findings of this work strongly support the use of CoDA transformations; however, they also highlight that even when log-ratio transformations are applied, interpreting the results outside of a CoDA framework can still lead to biased conclusions. Among CoDA transformations, alr has several advantages over clr: its reference is more explicit, which reduces the risk of interpreting estimates as absolute rather than relative, and it facilitates the replication of results in independent studies, as it only requires assessing changes relative to the same reference rather than reconstructing the full composition. In this work, we propose a heuristic method for selecting a suitable alr reference component, which will enable a more widespread use of this transformation.

11

linearPOA: A parallel, memory-efficient framework for Partial Order Alignment with linear space complexity

Wei, Y.; Huang, Z.; Zhang, P.; Tian, Q.; Li, Y.; Zou, Q.; Yu, L.

2026-04-30 bioinformatics 10.64898/2026.04.27.720899 medRxiv

Top 0.2%

5.4%

Show abstract

Multiple sequence alignment (MSA) is a fundamental problem in computational bioinformatics, playing a critical role in genome biology, especially in long read sequencing and assembly. One solution for representing and solving MSA is Partial Order Alignment (POA), which employs Directed Acyclic Graphs (DAGs) to represent sequence relationships. However, when facing the ultra-long, error-prone reads (e.g., >100 kbps), existing POA algorithms with quadratic space complexity become impractical due to excessive memory consumption. This paper introduces the linearPOA, which based on divide-and-conquer strategy to solve the POA, aimed at saving memory compared to quadratic space complexity algorithms like SPOA, abPOA and TSTA. Particularly notable is its capability to save up to 102.74 times memory usage when aligning sequences with 100 kbp reads, compared to the abPOA method using non-heuristic methods. The algorithm was implemented within the linearPOA library, providing functionality for POA and foundational support for sequencing analysis, like error correction for reads. The linearPOA algorithm provides memory-efficient algorithms for long-read sequencing, especially in directly assembling long reads like 100 kbp reads. AvailabilityThe linearPOA library is freely available at https://github.com/malabz/linearPOA, and the data underlying this article are available in Zenodo, at https://doi.org/10.5281/zenodo.15637837. Supplementary informationSupplementary information are available at BioRxiv online.

12

ProtAug: An Empirical Investigation of pLM-Guided Data Augmentation for Protein Sequence Prediction Tasks

Chen, Z.; Wang, R.; Luo, Q.

2026-07-11 bioinformatics 10.64898/2026.07.10.737545 medRxiv

Top 0.2%

4.3%

Show abstract

Protein language models (pLMs) offer great potential for protein sequence analysis, yet the scarcity of labeled data often limits their effectiveness in fine-tuning. Data augmentation is a promising remedy, but systematic evaluation of augmentation strategies for protein sequences remains limited, and the conditions under which augmentation confers downstream benefits are not well understood. In this paper, we systematically investigate pLM-guided substitution-based augmentation across seven protein prediction tasks. We propose ProtAug, a framework that leverages encoder-based (ESM-2) and autoregressive (ProtGPT2) pLMs to generate augmented sequences with user-controlled variation levels. Our investigation focuses on four questions: (Q1) whether pLM-synthesized sequences preserve more original signals than simpler methods, (Q2) to what extent augmentation improves prediction performance, (Q3) how variation levels affect downstream accuracy across tasks and models, and (Q4) whether biological plausibility is a necessary condition for achieving improvement. Our experimental results show that: (1) ProtAug Esm generally preserves motifs and structural similarity better than simple substitution, often comparable to homology retrieval; (2) augmentation yields consistent but task-dependent improvements, with ProtAug Esm achieving the best or second-best performance in 5 out of 7 tasks at 10% variation; (3) low-to-moderate variation levels (2-30%) perform best overall, although high-variation augmentation can benefit certain structure-related tasks; (4) the necessity of biological plausibility is task- and variation-dependent--while semantic preservation correlates with performance at low-to-moderate variation levels, improved generalization at high variation levels suggests that regularization effects, rather than label preservation, can also drive performance gains.

13

DDTRN: Predicting Bacterial Transcriptional Regulatory Networks Based on Gene Sequences using Dual Descriptor

Nie, P.; Ma, B.-G.

2026-07-01 bioinformatics 10.64898/2026.06.30.735580 medRxiv

Top 0.2%

4.3%

Show abstract

Accurate computational reconstruction of bacterial transcriptional regulatory network (TRN) from sequence information alone remains a fundamental challenge in systems biology, particularly for non-model organisms lacking extensive transcriptomic data. We present DDTRN, a sequence-driven framework that formulates TRN inference as a binary classification task over concatenated regulator-target gene sequence pairs and employs a Dual Descriptor (DD) model to predict regulatory interactions. The DD architecture represents a sequence into two learnable components: Composition Weight Map (CWM) and Position Weight Function (PWF). We comprehensively evaluate DDTRN against six conventional machine learning baselines across eight benchmark bacterial datasets, including E. coli (DREAM5, RegulonDB), B. subtilis, S. enterica, C. glutamicum, M. tuberculosis, P. aeruginosa, and S. coelicolor. DDTRN achieves superior overall performance, attaining average AUROC and AUPR scores of 0.869 and 0.868, respectively, with particularly pronounced advantages at lower descriptor ranks where positional weighting compensates for limited sequence context. Systematic sensitivity analyses of rank, embedding dimension, and basis function count reveal stable optimal operating regimes, while subsampling experiments demonstrate strong robustness even with limited training data. Interpretability analyses show that PWF learns distinct periodic contributions across different rank granularities and that CWM preferentially weights meaningful k-mers. A case study on E. coli dataset further illustrates that DDTRN identifies method-specific candidate targets complementary to those proposed by conventional approaches. By operating solely on genomic sequence, DDTRN provides a scalable, interpretable, and data-efficient framework for bacterial TRN inference in species where expression data are scarce, and it establishes a foundation for future multimodal integration with condition-specific regulatory information.

14

Statistical inference of the Tree of Blobs of a phylogenetic network from quartet concordance factors

Rhodes, J. A.; Allman, E. S.; Ane, C.; Banos, H.

2026-05-31 evolutionary biology 10.64898/2026.05.28.728501 medRxiv

Top 0.2%

4.2%

Show abstract

A phylogenetic network represents evolutionary relationships involving hybridization, gene flow, or admixture. While the full network may not be identifiable from genomic data under common coalescent models, its tree of blobs, depicting only the tree-like portions of the network structure, is. We introduce ECToBlob (Edge Contraction for Tree of Blobs), a new statistically-consistent algorithm to estimate the tree of blobs from quartet concordance factors. Starting from a resolved tree, ECToBlob successively contracts edges which statistical tests indicate do not belong in the tree of blobs, due to reticulate or polytomous signal. We show that ASTRAL provides a valid starting tree under common assumptions, in that, asymptotically in the number of loci, trees optimizing ASTRALs criterion refine the tree of blobs. We describe several algorithm variants, differing in how evidence from multiple tests are combined to determine if the edge should be contracted, and provide software implementations. Relevance to Life SciencesHybridization, gene flow, or admixture are now recognized as important aspects of evolutionary history, but their genomic signal is confounded with that from a coalescent process, creating substantial challenges for inferring phylogenetic networks. The networks tree of blobs identifies areas where reticulation occurred, separated by tree-like branching. ECToBlob quickly estimates the tree of blobs using quartet concordance factors from gene trees, and provides a measure of statistical support for its result. Performance is illustrated through simulation and on empirical data, using an implementation in the R package MSCquartets. While the presence of a blob may be all that can be inferred in some cases, in others ECToBlob offers a robust and principled way to focus further analyses on more local reticulate structure. Mathematical ContentThis work makes contributions to mathematical phylogenetics in optimization, combinatorics, and statistics. We show that any tree maximizing quartet support (the criterion underlying ASTRAL) is a refinement of the networks tree of blobs under the coalescent model. Second, we give a concise proof that whether a network has a cut-edge corresponding to a given split is determined by information in certain subcollections of its 4-taxon subnetworks (quarnets). Finally, we propose valid statistical approaches for combining p-values across multiple quarnet hypothesis tests, proving that their use with specific decreasing test levels leads to statistically consistent inference as the number of loci grows. MSC codes05C90, 60J95, 62-04, 62F07, 92D15

15

A unified smoothing framework for protein domain bigram model

Cui, X.; Iyer, G.; Durand, D.

2026-06-18 bioinformatics 10.64898/2026.06.14.732219 medRxiv

Top 0.2%

4.0%

Show abstract

MotivationBiomolecular sequences can be represented as strings over an alphabet, an analogy that has motivated many applications of computational linguistic techniques to biological problems. However, such methods must be adapted to the characteristic scale and organization of biomolecular data. Here, we consider the problem of bigram smoothing for multidomain protein architectures, where domain bigram frequency data is extremely sparse and differs from textual data in alphabet size, string length distribution, the relationship between bigram and unigram frequencies, tandem repeat lengths, and the distribution of domain adjacencies. Moreover, some domain combinations are unobserved because they are biologically incompatible, others because the data are incomplete. A smoothing method that distinguishes these two cases is required. ResultsWe propose a unified smoothing framework based on interpolation that can be tuned to accommodate different bigram data characteristics. Within this framework, we design specific model variants suited to protein domain bigram data: these assign low adjusted counts to pairs that are likely incompatible, while making appropriate adjustments for undersampled pairs. We demonstrate empirically that this approach distinguishes the two cases while preserving the characteristic signatures of multidomain data. Availability and implementationImplementations of smoothing methods, the scripts used to generate all results presented in this paper, and the curated lists of extracellular and DNA-binding domains are available at https://codeberg.org/xcui297/protein-domain-smoothing.

16

inGSEA: An Improved Method for Gene Set Enrichment Analysis Using a Weighted Integral Statistic

Zhang, Q.; Li, Q.

2026-06-05 bioinformatics 10.64898/2026.06.02.729106 medRxiv

Top 0.2%

4.0%

Show abstract

Gene Set Enrichment Analysis (GSEA) is one of the most popular methods for transcriptomic analysis, yet its statistical power is limited when the biological pathways exhibit heterogeneous or non-concordant expression patterns. We propose an improved GSEA method, integral-based GSEA (inGSEA). inGSEA introduces a novel enrichment score based on the Anderson-Darling weighted integral statistic. The new enrichment score enhances detection power for complex signals, particularly sparse and bidirectional ones, while the Cauchy combination of integral and classic maximum statistics provides robustness across diverse expression patterns. Extensive numerical studies demonstrate that inGSEA achieves superior power and well-calibrated false discoveries. Application to real-world datasets reveals biologically relevant pathways missed by the standard GSEA. inGSEA reduces the computational burden of permutation testing by employing a generalized gamma distribution to approximate the null distribution. inGSEA is accessible as a user-friendly web-based software tool (https://amss-stat.github.io/inGSEA).

17

GTRspmix: Capturing Heterogeneity of Exchangeabilities Across Sites to Improve Protein Phylogenetics

Harada, R.; Susko, E.; Wong, T. K. F.; Banos, H.; Ly-Trong, N.; Lanfear, R.; Theobald, D. L.; Minh, B. Q.; Roger, A. J.

2026-06-18 evolutionary biology 10.64898/2026.06.18.729217 medRxiv

Top 0.2%

4.0%

Show abstract

Site rate and profile mixture models capture the heterogeneity of the amino acid substitution process across sites. However, these models typically use a single matrix of amino acid exchangeabilities and ignore potential heterogeneities of these exchangeabilities across sites. Simply combining multiple exchangeability matrices with rate and profile mixtures leads to a combinatorial explosion of mixture components and a prohibitive increase in free parameters. Here, we introduce GTRspmix, a novel framework that incorporates multiple exchangeability matrices into profile and site rate mixture models while effectively managing model complexity. GTRspmix employs a clustering-based strategy that groups profiles and assigns a distinct exchangeability matrix to each profile cluster. Evaluations using both empirical and simulated datasets demonstrate that GTRspmix fits empirical data significantly better than conventional models, and that overparameterization does not present a problem for sufficiently large alignments. Based on these results, we estimated general-purpose empirical models (SXXpfamCYY series available in IQ-TREE3) from the Pfam database. These general-purpose models not only fit data much better, but they also influence branch length and tree topology estimates, effectively mitigating long-branch attraction artifacts. Because the total number of rate matrices remains manageable, the computational efficiency of the inference is identical to that of conventional profile mixture models (e.g., LG+C60+G4). GTRspmix provides a more realistic and flexible model of protein evolution, offering a robust foundation for the inference of reliable phylogenetic trees.

18

GRNPred: A Multimodal Graph Transformer with Masked Gene Expression Pretraining for Gene Regulatory Network Inference

Nguyen, T. M.; Hegde, A.; Cheng, J.

2026-04-29 bioinformatics 10.64898/2026.04.26.720917 medRxiv

Top 0.2%

4.0%

Show abstract

Gene regulatory network (GRN) inference is a fundamental problem in systems biology, aiming to identify transcription factor (TF)-target gene interactions from high-dimensional gene expression data. Accurate GRN reconstruction remains challenging due to limited labeled regulatory data, severe class imbalance, and the complex, nonlinear nature of transcriptional regulation. Here, we introduce GRNPred, a multimodal graph transformer framework for robust GRN inference that integrates gene expression, functional annotations, semantic gene descriptions, regulatory binding motif priors, and gene co-expression network topology. GRNPred follows a two-stage training strategy. In the first self-supervised pretraining phase, a graph transformer encoder is trained on TF-centered gene co-expression subgraphs using masked gene-expression reconstruction, enabling the model to learn transcriptional context from unlabeled data. In the second supervised fine-tuning stage, the pretrained encoder is finetuned for supervised TF-target edge prediction using available regulatory annotations. Transformer-based attention allows GRNPred to capture long-range and context-dependent regulatory interactions that are difficult to model with conventional graph neural networks. Extensive evaluation across 7 benchmark datasets and 3 regulatory network constructions demonstrates that GRNPred consistently outperforms state-of-the-art GRN inference methods, achieving AUROC scores of up to 0.94 and AUPRC scores of up to 0.93, while maintaining strong robustness across diverse biological contexts.

19

Does ensembling improve feature attributions from sequence-to-activity models?

Maslova, A.; Libbrecht, M.

2026-07-13 bioinformatics 10.64898/2026.07.08.737315 medRxiv

Top 0.3%

3.4%

Show abstract

Sequence-to-activity models take as input DNA sequence and predict genomic activities such as transcription factor binding and gene expression. Applying explainable AI (xAI) methods such as DeepLIFT to these models has recently led to breakthroughs towards many genomic problems, including transcription factor binding grammar and predicting effects of genetic variants. However, there remains significant uncertainty about the reliability of sequence-to-activity interpretations. Thus, we need accurate probabilistic measures of confidence to distinguish reliable from unreliable interpretations. Towards this end, researchers have recently aimed to characterize variability across ensembles of S2A models. However, previous work has focused on using model ensembles to improve the model predictions themselves. Here, we aim to evaluate whether model ensembles can also be used to improve feature attributions from post-hoc xAI methods. We find that ensembling attributions from multiple models improves downstream applications, including identifying transcription factor motifs and predicting regulatory genetic variants. We show that forming an ensemble using Monte Carlo Dropout (MCDropout) gets near to, but does not match, the performance of training multiple models, at much less train-time computational cost.

20

Characterizing the fragmentation of AlphaFold predictions

Sarti, E.; Cazals, F.

2026-06-05 bioinformatics 10.64898/2025.12.19.695436 medRxiv

Top 0.3%

3.3%

Show abstract

The Nobel prize winning program AlphaFold2 computes plausible structures of (well) folded proteins. The main quality assessment is based on the predicted Local Distance Difference Test (pLDDT), a per amino acid confidence score. To enhance quality assessment, we provide novel quantitative measures to identify coherent amino acid (a.a.) stretches along the sequence in terms of pLDDT values. These constructions, grounded in standard techniques from topological data analysis and combinatorics, provide a canonical framework for identifying regions along the protein backbone and analyzing their properties, such as their propensity for disorder and their consistency with a null model. The outcome of our analysis can readily be used to select reliable regions/domains within proteins whose pLDDT values span the entire pLDDT range.